By iandix, May of 2018

TLDR

In this tutorial we’ll build a Neural Network architecture known as Multi-Layer Perceptron to predict the species of a flower based on some of its physical attributes. We’ll do this with the R version of the very popular Keras API.

R Interface to Keras

Keras is a high-level neural networks API developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research. Keras has the following key features:

Keras was originally developed as a Python Library and only recently was ported to R with about the same functionality as the original version.

Keras Built-in Datasets

The Keras package already comes with the following built-in datasets:

Although these datasets are quite helpful for learners and well known in the Data Science community, for this specific tutorial we’ll choose another one which is really simple and easy to work with: the Iris Dataset.

The Iris Dataset

This is perhaps the best known database to be found in the pattern recognition literature. It is a classic in the field and was structured by R.A Fisher in 1936 still being referenced frequently to this day. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. The predicted attribute is the class of iris plant.

Most of the dataset attributes are measures over the Iris flowers sepal and petal:

  1. Sepal length in cm
  2. Sepal width in cm
  3. Petal length in cm
  4. Petal width in cm
  5. Class (text)

Sepal is the outer parts of the flower (often green and leaf-like) that enclose a developing bud. Sepals in most of the cases remain green and make the outer whorl of flowers and their basic function is protect the male and female sex organ in flower.

Petal is the parts of a flower that are often conspicuously colored. Petals in most of cases lies next inner to sepals and they function as attractant. They are variously colored and attract bees, birds etc so that pollination can be done.

Class is related to the flower species and can assume three varieties: Iris Setosa, Iris Versicolour and Iris Virginica.

For the Iris flowers the sepal and petal are both colorful, being not easily distinguishable, as one can see in the following picture:

The Whole Process

We all know nowadays Machine Learning is the icing on the cake. After all, as evangelized by the legendary Andrew Ng, Artificial Intelligence is the New Electricity. To get there though, it’s necessary to go through a laborious process. In R, this process steps are as follows:

  1. Get the Data from a Data Source
  2. Load the Data into R
  3. Explore and Get Familirialized with the Data
  4. Preprocess the Data
  5. Split the Data into Train and Test Sets
  6. Build the Machine Learning Model
  7. Evaluate the Performance of the Model
  8. Compile and Train the Model
  9. Use the Model to Predict with New Data
  10. Fine-Tune the Model: Hyperparameters
  11. Persist the Model Final Version

Getting and Loading the Data

So let’s load the Iris data from the University of California, Irvine archive. To do this we’ll read the dataset CSV file directly from an URL an then run some basic instructions to verify that the import process was sucessful:

# Load the Iris Dataset
iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE)

# Verify the first few items
head(iris)
# Inspect data structure
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ V1: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ V2: num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ V3: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ V4: num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ V5: Factor w/ 3 levels "Iris-setosa",..: 1 1 1 1 1 1 1 1 1 1 ...
# Check Dimensions
dim(iris)
## [1] 150   5

Exploring the Data

As demonstrated by the ‘head’ function, the imported CSV data has multiple types. The columns V1-V4 are ‘Doubles’ while the column V5 is a ‘Factor’. This tabular structure is called a Data Frame.

In order to allow a more didatic exploration, lets add names to the ‘data.frame’ columns:

# Adding names to columns
names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")

One interesting exploration that can be done is to verify the correlations among attributes. This can be done by means of a Correlogram which is a graphical representation of a correlation matrix. It is very useful to highlight the most correlated variables in a data table. In this plot, correlation coefficients are colored according to their values.

library(corrplot)
## corrplot 0.84 loaded
# Computing the correlation matrix
M <- cor(iris[,1:4])

# Plotting the correlation matrix
corrplot(M, method = "circle")

corrplot(M, method = "number")

We can see visually or numerically that some attributes have a positive correlation (shades of blue) whereas other ones have a negative correlation (shades of red).

To further investigate, lets plot an exemplary positive and negative correlation by using a scatter plot:

library(ggvis)

# Show positive correlation of 0.96 between petal length and width
iris %>% ggvis(~Petal.Length, ~Petal.Width, fill = ~Species) %>%
         layer_points()
# Show negative correlation of -0.11 betweem sepal lenght and width
iris %>% ggvis(~Sepal.Length, ~Sepal.Width, fill = ~Species) %>%
         layer_points()

Preprocessing the Data

Working with the data in a Data Frame format is quite convenient for exploratory purposes. It happens that for processing the data with the Keras library we’ll need to convert it to a purely numerical representation:

# Converting from Factor labels to numbers
iris[,5] <- as.numeric(iris[,5]) - 1

# Turning Iris into a matrix (matrices are purely numeric)
iris <- as.matrix(iris)

# Removing column labels by setting Iris 'dimnames' to 'NULL'
dimnames(iris) <- NULL

The ‘Species’ attribute (column 5) data type is ‘Factor’. Factors are categorical variables that are super useful in summary statistics, plots, and regressions. We’re converting “setosa”, “versicolor”, and “virginica”, to the numeric values 0, 1, and 2.

Next, we can proceed to verify the effect of normalizing the data. Data normalization is the process of rescaling one or more attributes to the range of 0 to 1. This means that the largest value for each attribute is 1 and the smallest value is 0.

library(keras)

# Summarizing the data before normalization
summary(iris)
##        V1              V2              V3              V4       
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.054   Mean   :3.759   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        V5   
##  Min.   :0  
##  1st Qu.:0  
##  Median :1  
##  Mean   :1  
##  3rd Qu.:2  
##  Max.   :2
hist(iris, main = "Before Normalization")

# Normalizing the data
iris_norm <- normalize(iris[,1:4], -1, 2)

# Summarizing the new normalized dataset
summary(iris_norm)
##        V1               V2               V3               V4         
##  Min.   :0.6539   Min.   :0.2384   Min.   :0.1678   Min.   :0.01473  
##  1st Qu.:0.7153   1st Qu.:0.3267   1st Qu.:0.2509   1st Qu.:0.04873  
##  Median :0.7549   Median :0.3544   Median :0.5364   Median :0.16415  
##  Mean   :0.7516   Mean   :0.4048   Mean   :0.4550   Mean   :0.14096  
##  3rd Qu.:0.7884   3rd Qu.:0.5252   3rd Qu.:0.5800   3rd Qu.:0.19753  
##  Max.   :0.8609   Max.   :0.6071   Max.   :0.6370   Max.   :0.28042
hist(iris_norm, main = "After Normalization")

As we can see by the histogram entitled ‘After Normalization’ all the data is now in the interval [0,1].

Another common operation is to standardize our numeric features. Data Standardization is the process of rescaling one or more attributes so that they have a mean value of 0 and a standard deviation of 1. Standardization assumes that your data has a Gaussian (bell curve) distribution. This does not strictly have to be true, but the technique is more effective if your attribute distribution is Gaussian.

We won’t standardize our Iris data. Actually, since our data spans along a very restricted range, initially there wouldn’t have even enough reason to normalize the data.

The usual motivations to normalize the features/attributes are:

Spliting Training and Test Sets

Based on the previous conclusions, we’ll keep working with the original Iris data, without normalizing it.

It’s time to organized the data into Training and Test Sets. The R libraries has plenty of utility functions to help with this task:

# Create a vector of indices 1 and 2 with 2/3 of 1's and 1/3 of 2's
ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.67, 0.33))

# Use indices to select data and compose training and test sets
iris.training.X <- iris[ind == 1, 1:4]
iris.test.X <- iris[ind == 2, 1:4]

# Also, split the labeled attribute
iris.training.y <- iris[ind == 1, 5]
iris.test.y <- iris[ind == 2, 5]

The ‘sample’ function takes the vector [1:2] as input, sets the number of items and generates randomly an index vector contaning 2/3 of 1’s and 1/3 of 2’s. The ‘replace’ variable when defined TRUE means that after each sampling, the whole set is preserved [1:2], meaning that the elements are not imediatelly extracted from the original vector.

Then we use a vector expression to generate a logical vector to be used as an index vector:

head(ind)
## [1] 1 1 1 2 1 1
head(ind == 2)
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE

Now that we have our training and test sets organized let’s enter the Machine Learning party.

Constructing the Machine Learning Model

Our motivating question was: How can we predict the Iris flower species (Versicolor, Setosa or Virginica) based on its sepal and petal attributes?

In order to be able to make such predictions we’ll now build the Machine Learning Model. One of the central Keras concepts is that of a Sequential Model: a linear stack of layers. To create a sequential model and add some layers:

library(keras)

# Initializing the sequential model
model <- keras_model_sequential()

# Adding layers to the model
model %>%
  layer_dense(units = 8, activation = 'relu', input_shape = c(4)) %>%
  layer_dense(units = 3, activation = 'softmax')

Note: the pipe operator in R (%>%) is quite common these days and avoid a lot of parentheses when chainning function calls. For a more detailed discussion on the pipe operator read the article Pipes in R: Tutorial for Beginners.

In the Iris case, we are constructing a Mutil Layer Perceptron which is a kind of Feed Forward Neural Network. It will work as a multi-class classifier.

Our NN will have:

Having created the Sequential Model, we can at any time get information of it by using some utility functions:

# Printing summary info for the model
summary(model)
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## dense_1 (Dense)                  (None, 8)                     40          
## ___________________________________________________________________________
## dense_2 (Dense)                  (None, 3)                     27          
## ===========================================================================
## Total params: 67
## Trainable params: 67
## Non-trainable params: 0
## ___________________________________________________________________________
# Getting model configuration
get_config(model)
## [{'class_name': 'Dense', 'config': {'kernel_initializer': {'class_name': 'VarianceScaling', 'config': {'distribution': 'uniform', 'scale': 1.0, 'seed': None, 'mode': 'fan_avg'}}, 'name': 'dense_1', 'kernel_constraint': None, 'bias_regularizer': None, 'bias_constraint': None, 'dtype': 'float32', 'activation': 'relu', 'trainable': True, 'kernel_regularizer': None, 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 'units': 8, 'batch_input_shape': (None, 4), 'use_bias': True, 'activity_regularizer': None}}, {'class_name': 'Dense', 'config': {'kernel_initializer': {'class_name': 'VarianceScaling', 'config': {'distribution': 'uniform', 'scale': 1.0, 'seed': None, 'mode': 'fan_avg'}}, 'name': 'dense_2', 'kernel_constraint': None, 'bias_regularizer': None, 'bias_constraint': None, 'activation': 'softmax', 'trainable': True, 'kernel_regularizer': None, 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 'units': 3, 'use_bias': True, 'activity_regularizer': None}}]
# Getting layer configuration
get_layer(model, index = 1)
## <keras.layers.core.Dense>
# Listing model's layers
model$layers
## [[1]]
## <keras.layers.core.Dense>
## 
## [[2]]
## <keras.layers.core.Dense>
# Listing input tensors
model$inputs
## [[1]]
## Tensor("dense_1_input:0", shape=(?, 4), dtype=float32)
# Listing the output tensors
model$outputs
## [[1]]
## Tensor("dense_2/Softmax:0", shape=(?, 3), dtype=float32)

The information produced by the ‘summary’ function show that we have 2 dense layers with a total of 67 parameters. This is because we have:

An interesting point to note is that while listing information about our input and output layers, the term Tensor appears. Tensors are generalizations of vectors and matrices to potentially higher dimensions. Tensors are common vocabulary when working with the TensorFlow library (We shall not forget that Keras is a high level API that uses TensorFlow beneath as a backend calculation engine).

Compiling and Training the ML Model

Before training a model, we have to configure the learning process via the ‘compile’ function:

# Compiling the model
model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
)

The ‘compile’ function has the following parameters:

The ‘categorical_crossentropy’ and ‘accuracy’ are pretty obviuos choices for our model. For the optimizer, in general terms, since our network is not as deep, and the training process is quick enough, we could test different optimizers and verify the results. According to this Quora post though Adam is often a solid choice, but for a deeper discussion you definetly should check Andrej Karpathy CS231 Lecture Notes which is quite comprehensive regarding learning and hyperparemeters.

Before finally training our Multi Layer Perceptron, we still have one more thing to do: change the encoding of the output labels to a One Hot Encoding Format. This way for every input example the respective output label will be a vector with 3 elements being two of them zeroed unless one in accordance with respective class:

# One hot encoding the training target values
iris.training.y.onehot <- to_categorical(iris.training.y)

# One hot encoding the test target values
iris.test.y.onehot <- to_categorical(iris.test.y)

# Printing out exemplary test data after one hot
print(head(iris.test.y.onehot))
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    1    0    0
## [3,]    1    0    0
## [4,]    1    0    0
## [5,]    1    0    0
## [6,]    1    0    0

Done in a very easy way with the help of keras auxiliary function ‘to_categorical’! Lets proceed with the training:

# Training the model 
history <- model %>% 
  fit(
    iris.training.X, 
    iris.training.y.onehot, 
    epochs = 200, 
    batch_size = 8, 
    validation_split = 0.2
  )
plot(history)

# Evaluate on test data and labels
score <- model %>% 
  evaluate(iris.test.X, iris.test.y.onehot, batch_size = 128)

# Print the score
print(score)
## $loss
## [1] 0.3790988
## 
## $acc
## [1] 0.7567568

As an exercise, if we had instead used only 100 epochs for training the network, we would had achieved a bigger loss and lower accuracy. The reality is that since our dataset is small in size we need to go across multiple rounds of training.

Althoug the score of loss and accuracy are pretty intuitive the plot of history is not that much. Let’s try to plot them separately to see if we can get a cleaner view:

# Plotting model loss for training data
plot(history$metrics$loss, main="Model Loss", xlab = "epoch", ylab="loss", col="blue", type="l")

# Plotting model loss for test data
lines(history$metrics$val_loss, col="green")

# Adding legend
legend("topright", c("train","test"), col=c("blue", "green"), lty=c(1,1))

# Plotting accuracy for training data 
plot(history$metrics$acc, main="Model Accuracy", xlab = "epoch", ylab="accuracy", col="blue", type="l")

# Plotting accuracy for validation data
lines(history$metrics$val_acc, col="green")

# Add Legend
legend("bottomright", c("train","test"), col=c("blue", "green"), lty=c(1,1))

Predicting with the Model

With the model trained and fitted to the data, we can now use our test set to predict labels (Iris flower class/species) over new inputs.

# Predicting classes for test data
classes <- model %>% 
  predict_classes(iris.test.X, batch_size = 128)
classes
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 1 2 2 2 1 1 1 1 1 1 2
## [36] 1 1

Evaluating Model Performance

We can use Keras to calculate loss and accuracy:

# Evaluating over test data and labels
score <- model %>% evaluate(iris.test.X, iris.test.y.onehot, batch_size = 128)

# Printing the score
print(score)
## $loss
## [1] 0.3790988
## 
## $acc
## [1] 0.7567568

A general way of evaluating model performance is by means of Confusion Matrices.

A Confusion Matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in a *predicted class while each column represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabeling one as another).

In the case of a binary classifier for instance, with classes named Pos and Neg, we would have:

Some definitions:

Another common representation of the Confusion Matrix is shown below:

Where,

Note that in this view the actual values are shown as rows whereas the predictied ones are shown as columns.

In this new nomenclature the previous metrics are:

\(Accuracy=\frac{TP+TN}{\#Examples}\)

\(Precision=\frac{TP}{TP+FP}\)

\(Recall=\frac{TP}{TP+FN}\)

Intuitively, we can say that:

Precision and recall avoid the problem with accuracy when dealing with skewed datasets. We shall note that there is a trade-off between precision and recall.

Ok., that’s a lot about Confusion Matrices for binary classification problems. For the case of a multi-class classifier:

In this terms:

\(Accuracy=\frac{\#Correct\ Classifications}{\#Classifications}\)

\(Precision\ of\ A=\frac{TP}{TP+FP}=\frac{TP_A}{TP_A+E_{BA}+E_{CA}+E_{DA}+E_{EA}}\)

\(Recall\ of\ A=\frac{TP}{TP+FN}=\frac{TP_A}{TP_A+E_{AB}+E_{AC}+E_{AD}+E_{AE}}\)

With Keras we can calculate the Confusion Matrix for our multi-classification problem using the ‘table’ function:

# Printing the Confusion Matrix
cm = table(iris.test.y, classes)
cm
##            classes
## iris.test.y  0  1  2
##           0 13  0  0
##           1  0  7  0
##           2  0  9  8
# Accuracy
mean(iris.test.y == classes)
## [1] 0.7567568
# Precision for class 0
cm[1,1]/(cm[1,1] + cm[2,1] + cm[3,1])
## [1] 1
# Recall for class 0
cm[1,1]/(cm[1,1] + cm[1,2] + cm[1,3])
## [1] 1
# Precision for class 1
cm[2,2]/(cm[2,2] + cm[1,2] + cm[3,2])
## [1] 0.4375
# Recall for class 1
cm[2,2]/(cm[2,2] + cm[2,1] + cm[2,3])
## [1] 1
# Precision for class 2
cm[3,3]/(cm[3,3] + cm[1,3] + cm[2,3])
## [1] 1
# Recall for class 2
cm[3,3]/(cm[3,3] + cm[3,1] + cm[3,2])
## [1] 0.4705882

Fine Tunning the Model

Even though our Neural Network is capable of learning its internal parameters, there are external ones known as Hyperparameteres.

In machine learning, a hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training.

Sometimes it can be difficult to choose a correct architecture for Neural Networks. Usually, this process requires a lot of experience because networks include many parameters. Some of the most important parameters that we can optimize for the neural network:

Even though the list of parameters in not even close to being complete, it’s still impressive how many parameters influences network’s accuracy.

Lets make some experiments.

Adding Layers

# Initialize the sequential model
model2 <- keras_model_sequential() 

# Add layers to model
model2 %>% 
    layer_dense(units = 8, activation = 'relu', input_shape = c(4)) %>% 
    layer_dense(units = 5, activation = 'relu') %>% 
    layer_dense(units = 3, activation = 'softmax')

# Compile the model
model2 %>%
  compile(
     loss = 'categorical_crossentropy',
     optimizer = 'adam',
     metrics = 'accuracy'
  )

# Fit the model to the data
history <- model2 %>% 
  fit(
     iris.training.X, 
     iris.training.y.onehot, 
     epochs = 200, 
     batch_size = 8, 
     validation_split = 0.2
  )
plot(history)

# Evaluate the model
score <- model2 %>% 
  evaluate(
    iris.test.X, 
    iris.test.y.onehot, 
    batch_size = 128
  )

# Print the score
print(score)
## $loss
## [1] 0.1903814
## 
## $acc
## [1] 0.9189189

Persisting the Model Final Version

save_model_hdf5(model2, "mlp-iris.h5")
#model2 <- load_model_hdf5("mlp-iris.h5")
#save_model_weights_hdf5("mlp-iris.h5")
#model2 %>% load_model_weights_hdf5("mlp-iris.h5")
#json_string <- model_to_json(model2)
#model2 <- model_from_json(json_string)
#yaml_string <- model_to_yaml(model2)
#model2 <- model_from_yaml(yaml_string)

Main References